In computer programming, machine code is computer program consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). For conventional binary number, machine code is the binaryOn nonbinary machines it is, e.g., a decimal representation. representation of a computer program that is actually read and interpreted by the computer. A program in machine code consists of a sequence of machine instructions (possibly interspersed with data).
Each machine code instruction causes the CPU to perform a specific task. Examples of such tasks include:
In general, each architecture family (e.g., x86, ARM) has its own instruction set architecture (ISA), and hence its own specific machine code language. There are exceptions, such as the VAX architecture, which includes optional support of the PDP-11 instruction set; the IA-64 architecture, which includes optional support of the IA-32 instruction set; and the PowerPC 615 microprocessor, which can natively process both PowerPC and x86 instruction sets.
Machine code is a strictly numerical language, and it is the lowest-level interface to the CPU intended for a programmer. Assembly language provides a direct map between the numerical machine code and a human-readable mnemonic. In assembly, numerical machine code and operands are replaced with mnemonics and labels. For example, the x86 architecture has available the 0x90 opcode; it is represented as NOP in the assembly source code. While it is possible to write programs directly in machine code, managing individual bits and calculating numerical memory address is tedious and error-prone. Therefore, programs are rarely written directly in machine code. However, an existing machine code program may be edited if the assembly source code is not available.
The majority of programs today are written in a high-level language. A high-level program may be translated into machine code by a compiler.
A processor's instruction set needs to execute the circuits of a computer's Logic level. At the digital level, the program needs to control the computer's registers, bus, memory, ALU, and other hardware components. To control a computer's architectural features, machine instructions are created. Examples of features that are controlled using machine instructions:
The criteria for instruction formats include:
Determining the size of the address field is a choice between space and speed. On some computers, the number of bits in the address field may be too small to access all of the physical memory. Also, virtual address space needs to be considered. Another constraint may be a limitation on the size of registers used to construct the address. Whereas a shorter address field allows the instructions to execute more quickly, other physical properties need to be considered when designing the instruction format.
Instructions can be separated into two types: general-purpose and special-purpose. Special-purpose instructions exploit architectural features that are unique to a computer. General-purpose instructions control architectural features common to all computers.
General-purpose instructions control:
For all but the IBM 7094 and 7094 II, there are three index registers designated A, B and C; indexing with multiple 1 bits in the tag subtracts the logical or of the selected index registers and loading with multiple 1 bits in the tag loads all of the selected index registers. The 7094 and 7094 II have seven index registers, but when they are powered on they are in multiple tag mode, in which they use only the three of the index registers in a fashion compatible with earlier machines, and require a Leave Multiple Tag Mode ( LMTM) instruction in order to access the other four index registers.
The effective address is normally Y-C(T), where C(T) is either 0 for a tag of 0, the logical or of the selected index registers in multiple tag mode or the selected index register if not in multiple tag mode. However, the effective address for index register control instructions is just Y.
A flag with both bits 1 selects indirect addressing; the indirect address word has both a tag and a Y field.
In addition to transfer (branch) instructions, these machines have skip instruction that conditionally skip one or two words, e.g., Compare Accumulator with Storage (CAS) does a three way compare and conditionally skips to NSI, NSI+1 or NSI+2, depending on the result.
6 5 5 5 5 6 bits [ op | rs | rt | rd |shamt| funct] R-type [ op | rs | rt | address/immediate] I-type [ op | target address ] J-type
rs, rt, and rd indicate register operands; shamt gives a shift amount; and the address or immediate fields contain an operand directly.
For example, adding the registers 1 and 2 and placing the result in register 6 is encoded:
[ op | rs | rt | rd |shamt| funct] 0 1 2 6 0 32 decimal 000000 00001 00010 00110 00000 100000 binary
Load a value into register 8, taken from the memory cell 68 cells after the location listed in register 3:
[ op | rs | rt | address/immediate] 35 3 8 68 decimal 100011 00011 01000 00000 00001 000100 binary
[ op | target address ] 2 1024 decimal 000010 00000 00000 00000 10000 000000 binary
In the 1970s and 1980s, overlapping instructions were sometimes used to preserve memory space. One example were in the implementation of error tables in Microsoft's Altair BASIC, where interleaved instructions mutually shared their instruction bytes. The technique is rarely used today, but might still be necessary to resort to in areas where extreme optimization for size is necessary on byte-level such as in the implementation of which have to fit into .
It is also sometimes used as a code obfuscation technique as a measure against disassembly and tampering.
The principle is also used in shared code sequences of fat binaries which must run on multiple instruction-set-incompatible processor platforms.
This property is also used to find unintended instructions called gadgets in existing code repositories and is used in return-oriented programming as alternative to code injection for exploits such as return-to-libc attacks.
Machine code and assembly code are sometimes called native code when referring to platform-dependent parts of language features or libraries.
The CPU knows what machine code to execute, based on its internal program counter. The program counter points to a memory address and is changed based on special instructions which may cause programmatic branches. The program counter is typically set to a hard coded value when the CPU is first powered on, and will hence execute whatever machine code happens to be at this address.
Similarly, the program counter can be set to execute whatever machine code is at some arbitrary address, even if this is not valid machine code. This will typically trigger an architecture specific protection fault.
The CPU is oftentimes told, by page permissions in a paging based system, if the current page actually holds machine code by an execute bit — pages have multiple such permission bits (readable, writable, etc.) for various housekeeping functionality. E.g. on Unix-like systems memory pages can be toggled to be executable with the system call, and on Windows, can be used to achieve a similar result. If an attempt is made to execute machine code on a non-executable page, an architecture specific fault will typically occur. Treating data as machine code, or finding new ways to use existing machine code, by various techniques, is the basis of some security vulnerabilities.
Similarly, in a segment based system, segment descriptors can indicate whether a segment can contain executable code and in what protection ring that code can run.
From the point of view of a process, the code space is the part of its address space where the code in execution is stored. In multitasking systems this comprises the program's code segment and usually shared libraries. In multi-threading environment, different threads of one process share code space along with data space, which reduces the overhead of context switching considerably as compared to process switching.
Machine code may also be decoded to high-level language under two conditions. The first condition is to accept an obfuscated reading of the source code. An obfuscated version of source code is displayed if the machine code is sent to a decompiler of the source language. The second condition requires the machine code to have information about the source code encoded within. The information includes a symbol table that contains . The symbol table may be stored within the executable, or it may exist in separate files. A debugger can then read the symbol table to help the programmer interactively debugging the machine code in execution.
|
|